Week 02: Lexical Analysis & Finite Automata

Lexical Analysis

Token Class (or Class)

Classify program substrings according to role (token class), class corresponding to sets of strings:

  • Identifier : string of letters or digits, starting with a letter
  • Integer: a non-empty string of digits
  • Keyword: else, if, begin, ...
  • Whitespace: a non-empty sequence of blanks, newlines, tabs

Goal of Lexical Analysis

  • Definition
    • lexeme: A lexeme is a sequence of characters that matches the pattern for a token.
    • token: A token is a pair consisting of the token name and the value.
  • Concept
    • Parttition the input string into lexemes.
    • Identity the token of each lexeme.
    • Communicate tokens to the parser.
  • Input: Program Substrings
  • Output: Tokens
  • Sample Input: string (foo=42)
  • Sample Output: <class, "string">, <'('>, <ID, "foo">, <Operator, "=">, <"Int", "42>, <')'>
  • Remark
    • Left-to-Right scan => lookahead required.

Regular Languages

Regular Expression

  • Concept: Regular expressions (syntax) specify regular languages (set of strings).
  • Two base cases
    • Single Character: $'c' = \{ "c" \}$
    • Empty String: $\epsilon = \{ "" \}$
  • 3 compound expressions
    • Union: $A + B = \{a\ |\ a \in A\} \cup \{ b\ |\ b \in B \}$
    • Concatenation: $AB = \{ab\ |\ a \in A \wedge b \in B \}$
    • Iteration: $ A^* = \bigcup_{i \geq 0} A^i, \begin{cases} A^i = A...A \\ A^0 = \epsilon \end{cases} $

Def. The regular expression over $\Sigma$ are the smallest set of expressions including: $$ \begin{align} R &= \epsilon \\ &|\ 'c', c\in\Sigma\\ &|\ R+R\\ &|\ RR\\ &|\ R^* \end{align} $$

Formal Languages

Def. Let $\Sigma$ be a set of characters (an alphabet). A language over $\Sigma$ is a set of strings of characters drawn from $\Sigma$. Meaning function $L$ maps regular expressions to set of strings.

$$ \begin{align} L(\epsilon) &= \{ "" \} \\ L('c') &= \{ "c" \} \\ L(A+B) &= L(A) \cup L(B) \\ L(AB) &= \{ab\ |\ a \in L(A) \wedge b \in L(B)\} \\ L(A^*) &= \bigcup_{i \geq 0} L(A)^i \\ \end{align} $$

Q:Why use a meaning function?

  • Make clear what is syntax, what is semantics.
  • Allow us to consider notation as a separate issue.
  • Because expression and meanings are not 1-1.
    • Meaning is many to one.

Lexical Specifications

Lexemes

  • Keyword: "if" / "else" / "then" / ...
    • $ 'if' + 'else' + 'then' + ...$
  • Integer: a non-empty string of digits
    • $digits: '0' + '1' + '2' + ...$
    • $(digit)(digit)^* = digit^+$
  • Identifier: strings of letters or digits, starting with a letter
    • $letter = 'a' + 'b' + 'c' + ... = [a-zA-Z]$
    • $(letter)(digit+letter)^*$
  • Whitespace: a non-empty sequence of blanks, newlines, and tabs
    • $('\;' + '\setminus n' + '\setminus t')^+$

More regular expressions

  • At least one: $A^+ = AA^*$
  • Union: $A|B = A+B$
  • Option: $A? = A+\epsilon$
  • Range: $'a'+'b'+\dots+'z' = [a-z]$
  • Excluded Range: $\widehat{[a-z]} = [\wedge a-z]$

How do we check $program \in L(R)$?

  1. Write a regular expression for the lexemes of each token class
    • Number $= digit^+$
    • Keyword $= 'if'+'else'+\dots$
    • Identifiew $= letter(letter+digit)^*$
    • OpenPar $='('$
    • ...
  2. Construct $R$ to match all lexemes.
    • $R = Number + Keyword + Number + ... = R_1 + R_2 + ...$
  3. Let input be $x_1...x_n$. Find the longest length $i$ such that $x_1...x_i \in L(R)$.
  4. Remove $x_1...x_i$. Go to step 3.
  5. If $x_1...x_i \in L(R_a) \cap L(R_b)$, apply $R_{\min(a,b)}$.

In [ ]: